statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, classification is the problem of identifying which of a set of

categories Category, plural categories, may refer to: Philosophy and general uses *Categorization, categories in cognitive science, information science and generally *Category of being * ''Categories'' (Aristotle) *Category (Kant) * Categories (Peirce) * ...

(sub-populations) an

observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...

(or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''features''. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for

blood type A blood type (also known as a blood group) is a classification of blood, based on the presence and absence of antibodies and inherited antigenic substances on the surface of red blood cells (RBCs). These antigens may be proteins, carbohydrate ...

), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an

email Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" meant ...

) or

real-valued In mathematics, value may refer to several, strongly related notions. In general, a mathematical value may be any definite mathematical object. In elementary mathematics, this is most often a number – for example, a real number such as or an i ...

(e.g. a measurement of

blood pressure Blood pressure (BP) is the pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system. When used without qualification, the term "blood pressure" r ...

). Other classifiers work by comparing observations to previous observations by means of a similarity or

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...

function. An

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algorithms are used as specificat ...

that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical

function Function or functionality may refer to: Computing * Function key, a type of key on computer keyboards * Function model, a structured representation of processes in a system * Function object or functor or functionoid, a concept of object-oriente ...

, implemented by a classification algorithm, that maps input data to a category. Terminology across fields is quite varied. In

, where classification is often done with

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...

or a similar procedure, the properties of observations are termed

explanatory variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

s (or independent variables, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

. In

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, the observations are often known as ''instances'', the explanatory variables are termed ''features'' (grouped into a

feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...

), and the possible categories to be predicted are ''classes''. Other fields may use different terminology: e.g. in

community ecology In ecology, a community is a group or association of populations of two or more different species occupying the same geographical area at the same time, also known as a biocoenosis, biotic community, biological community, ecological community, ...

, the term "classification" normally refers to

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

Relation to other problems

Classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

and clustering are examples of the more general problem of

pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...

, which is the assignment of some sort of output value to a given input value. Other examples are regression, which assigns a real-valued output to each input;

sequence labeling In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. A common example of a sequence labeling task is part of ...

, which assigns a class to each member of a sequence of values (for example,

part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definit ...

, which assigns a

part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assi ...

to each word in an input sentence);

parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from L ...

, which assigns a parse tree to an input sentence, describing the syntactic structure of the sentence; etc. A common subclass of classification is

probabilistic classification In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation sho ...

. Algorithms of this nature use

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...

to find the best class for a given instance. Unlike other algorithms, which simply output a "best" class, probabilistic algorithms output a

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...

of the instance being a member of each of the possible classes. The best class is normally then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers: *It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a ''confidence-weighted classifier''). *Correspondingly, it can ''abstain'' when its confidence of choosing any particular output is too low. *Because of the probabilities which are generated, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of ''error propagation''.

Frequentist procedures

Early work on statistical classification was undertaken by

Fisher Fisher is an archaic term for a fisherman, revived as gender-neutral. Fisher, Fishers or The Fisher may also refer to: Places Australia *Division of Fisher, an electoral district in the Australian House of Representatives, in Queensland *Elect ...

, in the context of two-group problems, leading to

Fisher's linear discriminant Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

function as the rule for assigning a group to a new observation.Gnanadesikan, R. (1977) ''Methods for Statistical Data Analysis of Multivariate Observations'', Wiley. (p. 83–86) This early work assumed that data-values within each of the two groups had a

multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...

. The extension of this same context to more than two-groups has also been considered with a restriction imposed that the classification rule should be

linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...

. Later work for the multivariate normal distribution allowed the classifier to be

nonlinear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many othe ...

: several classification rules can be derived based on different adjustments of the Mahalanobis distance, with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation.

Bayesian procedures

Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the different groups within the overall population. Bayesian procedures tend to be computationally expensive and, in the days before

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...

computations were developed, approximations for Bayesian clustering rules were devised. Some Bayesian procedures involve the calculation of group membership probabilities: these provide a more informative outcome than a simple attribution of a single group-label to each new observation.

Binary and multiclass classification

Classification can be thought of as two separate problems –

binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

and

multiclass classification In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...

. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers.

Feature vectors

Most algorithms describe an individual instance whose category is to be predicted using a

of individual, measurable properties of the instance. Each property is termed a

feature Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...

, also known in statistics as an

(or independent variable, although features may or may not be

statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...

). Features may variously be

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...

(e.g. "on" or "off"); categorical (e.g. "A", "B", "AB" or "O", for

); ordinal (e.g. "large", "medium" or "small"); integer-valued (e.g. the number of occurrences of a particular word in an email); or

(e.g. a measurement of blood pressure). If the instance is an image, the feature values might correspond to the pixels of an image; if the instance is a piece of text, the feature values might be occurrence frequencies of different words. Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be ''discretized'' into groups (e.g. less than 5, between 5 and 10, or greater than 10).

Linear classifiers

A large number of

s for classification can be phrased in terms of a

linear function In mathematics, the term linear function refers to two distinct but related notions: * In calculus and related areas, a linear function is a function whose graph is a straight line, that is, a polynomial function of degree zero or one. For dist ...

that assigns a score to each possible category ''k'' by combining the feature vector of an instance with a vector of weights, using a

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...

. The predicted category is the one with the highest score. This type of score function is known as a

linear predictor function In statistics and in machine learning, a linear predictor function is a linear function ( linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent vari ...

and has the following general form:

\operatorname(\mathbf_i, k) = \boldsymbol\beta_k \cdot \mathbf_i,

where X_''i'' is the feature vector for instance ''i'', β_''k'' is the vector of weights corresponding to category ''k'', and score(X_''i'', ''k'') is the score associated with assigning instance ''i'' to category ''k''. In

discrete choice In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such ...

theory, where instances represent people and categories represent choices, the score is considered the

utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...

associated with person ''i'' choosing category ''k''. Algorithms with this basic setup are known as

linear classifier In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the val ...

s. What distinguishes them is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. Examples of such algorithms include * ** * * The

perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...

algorithm * *

Algorithms

Since no single form of classification is appropriate for all data sets, a large toolkit of classification algorithms have been developed. The most commonly used include: * * * ** * ** ** ** * ** * * ** ** ** ** * * **

Evaluation

Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the no-free-lunch theorem). Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science. The measures precision and recall are popular metrics used to evaluate the quality of a classification system. More recently,

receiver operating characteristic A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally developed for operators of m ...

(ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms. As a performance metric, the

uncertainty coefficient In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil and is based on the concept of information entropy. Definition S ...

has the advantage over simple

accuracy Accuracy and precision are two measures of ''observational error''. ''Accuracy'' is how close a given set of measurements ( observations or readings) are to their ''true value'', while ''precision'' is how close the measurements are to each oth ...

in that it is not affected by the relative sizes of the different classes. Further, it will not penalize an algorithm for simply ''rearranging'' the classes.

Application domains

Classification has many applications. In some of these it is employed as a data mining procedure, while in others more detailed statistical modeling is undertaken. * * identification * ** Medical image analysis and ** ** * * * Drug discovery and ** ** * * * Internet * Micro-array classification * * * *

References

{{DEFAULTSORT:Statistical Classification *